feat!: multilingual text-to-speech by IgorSwat · Pull Request #1134 · software-mansion/react-native-executorch

IgorSwat · 2026-05-08T14:23:32Z

Description

Introduces major changes to the text-to-speech module based on Kokoro model, including:

Multilingual text-to-speech - a set of complete pipelines & voices for different languages. A complete list of (currently) supported languages can be found below.
Improved phonemization & speech quality - utilizing neural phonemization model as a fallback for the old lexicon-base phonemization significantly improves speech quality, particularly for non-standard, out of dictionary words.
Timestamp-based audio cutting - an improve postprocessing algorithm, eliminates artifacts introduced by .pte model, resulting in cleaner, more natural speech.
API changes: prepared for voice-cloning & custom, fine-tuned versions of Kokoro model.

Supported language current status:

🇺🇸 American English: ✅
🇬🇧 British English: ✅
🇫🇷 French: ✅
🇪🇸 Spanish: ✅
🇵🇹/🇧🇷 Portugese: ✅
🇮🇹 Italian: ✅
🇵🇱 Polish: ✅
🇩🇪 German: ✅
🇮🇳 Hindi: ✅
🇯🇵 Japanese: ❌ (coming soon)
🇨🇳 Mandarin Chinese: ❌ (coming soon)

Introduces a breaking change?

Yes
No

There are 2 major breaking changes introduced by this PR:

Changed "synthezation from phonemes" API.

Old API:

 const audioData = await tts.forwardFromPhonemes({
   phonemes:
     'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
 });

New API:

const audioData = await tts.forward({
  text:
    'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
   phonemize: false,  # Disables phonemization and treats text as phonemes
});

Changed predefined model - voice setups. Now both model files & voice/phonemization files are bundled together, due to languages like Polish or German having fine-tuned model weights.

Old API:
```
const model = useTextToSpeech({
  model: KOKORO_MEDIUM,
  voice: KOKORO_VOICE_AF_HEART,
});
```
New API:
```
const model = useTextToSpeech(KOKORO_AMERICAN_ENGLISH_FEMALE_HEART);
```

Type of change

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Documentation update (improves or adds clarity to existing documentation)
Other (chores, tests, code style improvements etc.)

Tested on

iOS
Android

Testing instructions

Play around demo speech apps.

Unit tests for RNE-specific code will be added later on.
Phonemis package has it's own, wide range of unit tests implemented (see Phonemis repo)

Screenshots

Related issues

#712

Checklist

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have updated the documentation accordingly
My changes generate no new warnings

Additional notes

msluszniak

You should also update the code in documentation and documentation in general. Also address lint warnings, there are plenty of them that you need to add to cspell ignore.

msluszniak · 2026-05-08T15:30:49Z

Also if this PR adds breaking change, please describe it directly below Introduces a breaking change? section in PR body.

…voice

…e type aliases TypeDoc emits `export type` declarations under `06-api-reference/type-aliases/`, not `06-api-reference/interfaces/`. The links in useTextToSpeech.md pointed at the interfaces/ paths, which never get generated for these names, breaking the Docusaurus build (`onBrokenLinks: 'throw'`).

…nsion/react-native-executorch into @is/multilingual-tts

- tests/CMakeLists.txt: build phonemis from source (add_subdirectory) and propagate its include dir to rntests_core. The previous IMPORTED STATIC pointed at a libphonemis.a that nothing builds. - FrameTransformTest, ObjectDetectionTest, InstanceSegmentationTest: update bbox member access for #1130's BBox refactor (.x1/.y1/.x2/.y2 → .p1.x/.p1.y/.p2.x/.p2.y). - PoseEstimationTest: keypoint type became float in #1130; update the static_assert from int32_t to float. - FrameTransformTest: make the three Right_* tests platform-aware. Production inverseRotateBbox/inverseRotatePoints are a no-op on Android for Right (front-cam upright portrait); rotateFrameForModel rotates CW on Android vs CCW on iOS. Tests now have #if defined(__APPLE__) branches matching production. - SpeechToTextTest: GTEST_SKIP TranscribeReturnsValidChars with a TODO — known-failing on this branch, needs separate investigation. - run_tests.sh: fix two stale Hugging Face URLs (fsmn-vad and yolo26n-pose filenames had changed upstream, causing wget to 404 and silently abort the script).

msluszniak

Please make sure that iOS is also tested since I don't have any for testing.

barhanc

I've tested the Speech app on iOS and standard usage is fine. There are however some lurking bugs, in Text to Speech app trying to generate speech using 'AF Heart' from text written in hindi alphabet (e.g. 'शुभ प्रभात') crashes the whole app with offset + count out of range in audio = audio.subspan(0, lastTokenTimestamp); in rnexecutorch::models::text_to_speech::kokoro::Kokoro::synthesize

msluszniak · 2026-05-19T18:23:04Z

I've tested the Speech app on iOS and standard usage is fine. There are however some lurking bugs, in Text to Speech app trying to generate speech using 'AF Heart' from text written in hindi alphabet (e.g. 'शुभ प्रभात') crashes the whole app with offset + count out of range in audio = audio.subspan(0, lastTokenTimestamp); in rnexecutorch::models::text_to_speech::kokoro::Kokoro::synthesize

Uuu, good catch. have you tried other characters specific for a language like ü, etc.?

barhanc · 2026-05-19T18:48:30Z

I've tested the Speech app on iOS and standard usage is fine. There are however some lurking bugs, in Text to Speech app trying to generate speech using 'AF Heart' from text written in hindi alphabet (e.g. 'शुभ प्रभात') crashes the whole app with offset + count out of range in audio = audio.subspan(0, lastTokenTimestamp); in rnexecutorch::models::text_to_speech::kokoro::Kokoro::synthesize

Uuu, good catch. have you tried other characters specific for a language like ü, etc.?

Yeah, I tried the same for german-, french-, and spanish-specific characters and there wasn't any problem.

…nsion/react-native-executorch into @is/multilingual-tts

IgorSwat · 2026-05-19T19:06:52Z

I've tested the Speech app on iOS and standard usage is fine. There are however some lurking bugs, in Text to Speech app trying to generate speech using 'AF Heart' from text written in hindi alphabet (e.g. 'शुभ प्रभात') crashes the whole app with offset + count out of range in audio = audio.subspan(0, lastTokenTimestamp); in rnexecutorch::models::text_to_speech::kokoro::Kokoro::synthesize

Fixed.

msluszniak · 2026-05-19T19:08:37Z

@IgorSwat inspired by Bartek's finding I'm trying some other other attack to expose some problem. Will come back with my finds.

msluszniak · 2026-05-19T19:40:47Z

TTS edge-case findings from stress testing

Ran a battery of inputs against Kokoro::generate (via forward()) and the streaming path on Android. Bartek's exact crash is no longer reproducible — e173e9d94 fixes it. The following are still open.

1. `speed` parameter has no validation

`speed`	result
`0`	throws bare `std::exception` (no message, no error code)
`NaN` / `Infinity` / `-1` / `1e9`	silently accepted; emits ~5500 samples regardless of text
`1e-6`	emits 150 945 samples (≈6.3 s) for `"Hello world"`

1e-6 is the most worrying — audioLength = kTicksPerDuration * effectiveDuration is int32_t (Kokoro.cpp:349); a small enough speed overflows that and the synthesizer allocates unbounded memory. Suggested guard: reject non-finite or ≤ 0 speeds at the JS boundary and in Kokoro::generate, with a real RnExecutorchError(InvalidUserInput, …).

2. Streaming worker hangs on non-EOS content

Kokoro.cpp:171-189:

size_t chunkSize = (eosIt != inputTextBuffer_.rend())
                       ? std::distance(eosIt, inputTextBuffer_.rend())
                       : 0;

if (chunkSize > 0 ||
    streamSkippedIterations >= params::kStreamMaxSkippedIterations) {
  input = inputTextBuffer_.substr(0, chunkSize);   // chunkSize still 0
  inputTextBuffer_.erase(0, chunkSize);            // erases nothing
  streamSkippedIterations = 0;                     // reset, loop forever
}

When streamInsert content has no end-of-sentence character, the buffer never drains. The skip-threshold force-flush path fires but uses chunkSize=0 to extract the chunk, so it produces an empty input and resets the counter. streamStop(false) then waits forever for the buffer to empty. streamStop(true) is the only recovery.

Repros: streamInsert('a'), streamInsert('hello world'), 2000× U+200D — all permanently hang the worker.

Suggested fix:

if (chunkSize > 0) {
  // normal flush by EOS
} else if (streamSkippedIterations >= params::kStreamMaxSkippedIterations) {
  input = inputTextBuffer_.substr(0, searchLimit);
  inputTextBuffer_.erase(0, searchLimit);
  streamSkippedIterations = 0;
} else {
  streamSkippedIterations++;
}

3. `streamStop(true)` drops in-flight audio silently

Kokoro.cpp:137-145:

auto nativeCallback = [this, callback](const std::vector<float> &audioVec) {
  if (this->isStreaming_) {        // false after streamStop(true)
    this->callInvoker_->invokeAsync(...);
  }
};

If streamStop(true) lands while a chunk is mid-synthesis, the synthesizer finishes the chunk and then the callback no-ops — the audio is generated and discarded with no signal. In a captioning / live-narration context that's a silently lost sentence.

Suggested fix: deliver the chunk that completed before the stop, or surface "aborted with in-flight chunk discarded" through onEnd/onCancel.

4. Observational (optional): solo punctuation produces ~0.5 s of audible artifacts

Single-character inputs like ., !, ?, ... produce 12k–17k samples of mostly-silent audio that contains low-amplitude artifacts the model emits while filling the duration predictor's window. stripAudio's silence threshold doesn't catch them, so the user hears a faint click/breath. Not a crash, not strictly wrong — flagging as observed behavior in case it's worth a guard later (e.g. early-return when all content phonemes are punctuation).

Reproducing

Stress-test version of the speech app lives on branch @ms/tts-stress-tests — the only changed file is apps/speech/screens/TextToSpeechScreen.tsx (preset chip rows + force-stop button + a switch from model.stream() to model.forward() so the hook's silent period-append doesn't mask the actual model behavior).

git fetch origin
git checkout @ms/tts-stress-tests
yarn install
cd apps/speech && yarn android
adb logcat -c && adb logcat ReactNativeJS:V AndroidRuntime:E DEBUG:V libc:E '*:F'

Open the app → Text To Speech screen.

Top row — "Test presets" drives forward() (one-shot synthesis, no streaming wrapper):

space, spaces, newline, dot, excl, q, ... — punctuation/empty edge cases (relates to finding 4)
Hindi, Arabic, Chinese, Japanese, Hebrew, Russian, Korean — script-mismatch with English voice (Bartek's family ;p)
emoji, emoji-mix, ZW-chars, NUL, EN+Hindi, diacritics — non-vocab / mixed-script
speed=0, speed=NaN, speed=Inf, speed=-1, speed=1e-6, speed=1e9 — drives finding 1
noPh:EN, noPh:nums, noPh:syms — phonemize: false with non-phoneme input

Bottom row — "Streaming tests" drives streamInsert + stream() directly (bypassing the hook's . append):

no-term:a, no-term:long — finding 2; these will hang the worker (tap the red Force stop button to recover)
many-EOS — sanity check (multiple sentences in one insert)
insert-flood-EOS — concurrency / buffer growth under load (no race observed)
race:stop-during-synth — finding 3
race:insert-during-synth — sanity check (no data loss observed)

Tap Force stop any time a streaming test hangs — it calls streamStop(true) so you can keep testing.

Interpreting the logs. Each tap emits one of:

I ReactNativeJS: [TTS-test]   text=<json> speed=<n> phonemize=<bool>
I ReactNativeJS: [TTS-test]   forward() returned <N> samples
I ReactNativeJS: [TTS-test]   threw: <message>

I ReactNativeJS: [TTS-stream] start: <label>
I ReactNativeJS: [TTS-stream] <label> chunk #<n>: <N> samples (t=<ms>)
I ReactNativeJS: [TTS-stream] end: <label> — <chunks> chunks, <samples> samples, <ms>ms
I ReactNativeJS: [TTS-stream] threw: <label> -> <message>

Quick decoder:

returned 0 samples → safe no-op (input had no usable phonemes).
returned <large N> with weird input → check whether the model produced unintended audio (findings 1 / 4).
threw: std::exception with no detail → wrap with proper RnExecutorchError somewhere upstream.
start: <label> followed by no chunks and a multi-minute duration → streaming worker is hung (finding 2); you need Force stop.
start: <label> followed by no chunks but quick end (≲1 s) → in-flight chunk dropped (finding 3) — synthesizer ran, callback no-op'd.
chunk #N lines mean audio was delivered to JS. The streaming sanity tests (many-EOS, insert-flood-EOS, race:insert-during-synth) should each produce several chunks.

msluszniak

Comment above

IgorSwat · 2026-05-20T08:12:21Z

@msluszniak

I added checks for speed parameter
I know this issue for a long time. The thing is, for normal (one-shot) inputs it never happens, cause we always add a dot '.' if no EOS character is present at the end. And I once changed it in exact the same way as your "reviewer" suggested, but then the LLM streaming mode often fails, because it's very hard to adjust the kStreamMaxSkippedIterations parameter to not stop the streaming prematurely, because it's dependent on LLM generation speed. So fixing it properly is a non-trivial task, and I don't think we have time for that.
Harmless issue, I don't see a point in complicating the API further for something like that.
Another harmless issue. I don't see a point in covering all the "stupid" inputs from user, unless it results in a direct crash.

msluszniak · 2026-05-20T08:21:26Z

I agree that 4 is rather good to skip.
Regarding 3, ok I just wanted to be sure you are aware of this buggy behaviour.

Regarding 2:

cause we always add a dot '.'

Where do we add this dot?

IgorSwat · 2026-05-20T08:26:01Z

Where do we add this dot?

useTextToSpeech.ts, lines 107 to 112.

msluszniak · 2026-05-20T08:40:56Z

Where do we add this dot?

useTextToSpeech.ts, lines 107 to 112.

Ok, what about textToSpeechModule? I cannot see the similar hack.

IgorSwat · 2026-05-20T08:45:23Z

Ok, what about textToSpeechModule? I cannot see the similar hack.

We can't do that in textToSpeechModule. And I honestly do not see a point in doing so, if it is already done in the hook.

msluszniak · 2026-05-20T08:47:45Z

Hmmm, the very minimum we need to do is to escalate this into separate issue since it is a serious problem. After 0.9 release I will work on the solution that both solves this issue and do not break llm integration.

And by the way, hook is completely separate mechanism, if we have this hack in module that is re-used by the hook, then ok. But other way around, I disagree.

msluszniak

As said, either me or you need to work on the eos issue, but except that, I have no other things to add. Great job overall! :))

IgorSwat requested review from chmjkb and msluszniak May 8, 2026 14:24

IgorSwat force-pushed the @is/multilingual-tts branch from 8380a2a to eb999a7 Compare May 8, 2026 14:26

IgorSwat self-assigned this May 8, 2026

IgorSwat added feature PRs that implement a new feature improvement PRs or issues focused on improvements in the current codebase labels May 8, 2026

IgorSwat changed the title ~~feat: multilingual text-to-speech~~ feat!: multilingual text-to-speech May 8, 2026

msluszniak requested changes May 8, 2026

View reviewed changes

msluszniak reviewed May 11, 2026

View reviewed changes

chmjkb requested changes May 12, 2026

View reviewed changes

msluszniak linked an issue May 18, 2026 that may be closed by this pull request

Text to Speech - add new languages support #712

Closed

5 tasks

barhanc and others added 18 commits May 18, 2026 17:25

build: update ios libs

e6b59cd

Link phonemis with Android build

717c16c

Link phonemis with iOS build

8c74d42

Adjust typescript API to new tts structure

3609385

Fix model picker & adjust to new Phonemis API

b4a27e3

Add spanish

ad3c76a

Add italian

5273507

Add basic polish, portugese and hindi

847d352

Partitioner refactor

b43b75b

Adjust Kokoro API to new Partitioner

7614cca

Adjust native/JS API

eca179e

Native side refactor

249440d

Implement dynamic phonemization

dd297d6

Silence tsconfig warnings

088dfd5

Introduce finetuned Kokoro

b9d736a

Add audio volume up

4b0e928

Improve audio trimming algorithm

c3b4d9f

Change the typescript API to allow custom model weights bundled with …

c356673

…voice

msluszniak and others added 3 commits May 19, 2026 11:31

Update t2s tests

1ad23ea

Merge branch '@is/multilingual-tts' of https://github.com/software-ma…

38340f6

…nsion/react-native-executorch into @is/multilingual-tts

IgorSwat force-pushed the @is/multilingual-tts branch from 10e8e1c to 38340f6 Compare May 19, 2026 11:32

msluszniak approved these changes May 19, 2026

View reviewed changes

barhanc self-requested a review May 19, 2026 15:18

barhanc reviewed May 19, 2026

View reviewed changes

IgorSwat added 2 commits May 19, 2026 21:04

Fix audio index out of bounds bug

e173e9d

Merge branch '@is/multilingual-tts' of https://github.com/software-ma…

065c264

…nsion/react-native-executorch into @is/multilingual-tts

msluszniak requested changes May 20, 2026

View reviewed changes

msluszniak reviewed May 20, 2026

View reviewed changes

Comment thread packages/react-native-executorch/common/rnexecutorch/models/text_to_speech/kokoro/Kokoro.cpp Outdated

Add clamping speed parameter

81daf0d

IgorSwat force-pushed the @is/multilingual-tts branch from 81c1766 to 81daf0d Compare May 20, 2026 08:14

msluszniak mentioned this pull request May 20, 2026

Kokoro::stream hangs on non-EOS-terminated buffer; streamStop(false) never returns #1153

Open

msluszniak approved these changes May 20, 2026

View reviewed changes

chmjkb approved these changes May 20, 2026

View reviewed changes

IgorSwat merged commit 9f752b6 into main May 20, 2026
5 checks passed

IgorSwat deleted the @is/multilingual-tts branch May 20, 2026 09:17

Conversation

IgorSwat commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Introduces a breaking change?

Type of change

Tested on

Testing instructions

Screenshots

Related issues

Checklist

Additional notes

Uh oh!

msluszniak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

msluszniak commented May 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

msluszniak left a comment

Choose a reason for hiding this comment

Uh oh!

barhanc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msluszniak commented May 19, 2026

Uh oh!

barhanc commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IgorSwat commented May 19, 2026

Uh oh!

msluszniak commented May 19, 2026

Uh oh!

msluszniak commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TTS edge-case findings from stress testing

1. speed parameter has no validation

2. Streaming worker hangs on non-EOS content

3. streamStop(true) drops in-flight audio silently

4. Observational (optional): solo punctuation produces ~0.5 s of audible artifacts

Reproducing

Uh oh!

msluszniak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IgorSwat commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msluszniak commented May 20, 2026

Uh oh!

IgorSwat commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msluszniak commented May 20, 2026

Uh oh!

IgorSwat commented May 20, 2026

IgorSwat commented May 8, 2026 •

edited

Loading

barhanc left a comment •

edited

Loading

barhanc commented May 19, 2026 •

edited

Loading

msluszniak commented May 19, 2026 •

edited

Loading

1. `speed` parameter has no validation

3. `streamStop(true)` drops in-flight audio silently

IgorSwat commented May 20, 2026 •

edited

Loading

IgorSwat commented May 20, 2026 •

edited

Loading

msluszniak commented May 20, 2026 •

edited

Loading